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Abstract 

We  present  the  Ubuntu  Chat  Corpus  as  a  data  source  for 
multiparticipant  chat  analysis.  This  addresses  the  prob¬ 
lem  of  the  lack  of  a  large,  publicly  suitable  corpora  for 
research  in  this  medium.  The  advantages  of  using  this 
corpus  for  research  is  its  large  number  of  chat  messages, 
its  multiple  languages,  its  technical  nature,  and  all  of  the 
original  chat  messages  are  in  the  public  domain. 

Introduction 

Multiparticipant  chat  is  a  form  of  chat  where  multiple  par¬ 
ticipants  are  conversing  synchronously  through  text  commu¬ 
nications.  Examples  of  multiparticipant  chat  include  Inter¬ 
net  Relay  Chat  (IRC),  virtual  game  lounges  (e.g.,  Battle.net, 
Steam),  game  environments  (e.g.,  MUDs,  MMORPGs),  and 
collaborative  learning  environments.  Growing  interest  in  the 
past  decade  has  looked  at  problems  such  as  thread  dis¬ 
entanglement  (Eisner  and  Charniak  2010),  topic  detection 
(Durham  2009;  Trausan-Matu  et  al.  2007),  author  profil¬ 
ing  (Lin  2007;  Kose,  Ozyurt,  and  Ikiba§  2008),  and  mes¬ 
sage  attribute  identification  (Dela  Rosa  and  Ellen  2009; 
Wu  et  al.  2005).  One  thing  this  research  area  has  been  miss¬ 
ing  though  is  a  large,  public  corpus  of  chat  messages.  Having 
such  a  corpus  would  allow  for  better  comparison  of  differ¬ 
ent  techniques,  standardization  of  evaluations,  and  make  it 
easier  for  researchers  to  enter  the  field. 

We  describe  the  Ubuntu  Chat  Corpus  as  a  data  source  of 
research  for  multiparticipant  chat  analysis.  This  corpus  con¬ 
sist  of  messages  from  Ubuntu’s  IRC  support  channels.  In  this 
paper,  we  first  describe  how  we  constructed  this  corpus,  fol¬ 
lowed  by  how  it  compares  with  other  chat  data  sources.  We 
then  conclude  by  proposing  some  open  research  problems 
that  can  be  investigated  using  this  corpus. 

Background 

Chat  has  not  had  a  large  corpora  available  for  public  use 
despite  it  being  an  old  medium  -  MUDs  began  in  the 
1970s  and  IRC  was  created  in  1988  (Herring  in  press; 
Reid  1991).  There  are  some  comparatively  small,  annotated 
corpora  being  used  for  current  chat  research,  such  as  the 
NPS  Chat  Corpus  (Forsyth  and  Martell  2007;  Lin  2007), 
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the  #LINUX  corpus  (Eisner  and  Charniak  2010),  and  the 
#IPHONE/#PHYSICS/#PYTHON  corpus  (Adams  2008). 
For  many  other  research  investigations  though,  the  authors 
either  used  archives  which  have  unknown  copyright  status 
or  used  self-collected  data  which  were  not  made  publicly 
available,  making  it  difficult  to  comparatively  evaluate  dif¬ 
ferent  techniques. 

Corpus  Description 

Ubuntu,  a  Linux-based  operating  system,  has  multiple  IRC 
channels  (found  on  the  freenode  IRC  network)  for 
technical  support  and  development  coordination.  Ubuntu 
started  using  IRC  in  2004  with  one  channel,  #ubuntu 
(which  is  still  their  primary  support  channel),  and  has 
since  then  expanded  to  multiple  channels  for  more  spe¬ 
cific  topics  as  well  as  support  in  non-English  languages. 
All  messages  are  logged  and  kept  in  a  public  archive  at 
http : //ir clogs . ubuntu . com/. 

We  created  a  corpus  from  the  logs  between  2004-07- 
05  until  2012-10-17  (the  day  before  the  release  of  Ubuntu 
12.10).  We  selected  eleven  frequently-used  channels  from 
the  archive,  including  seven  non-English  channels,  which 
are  listed  in  Table  1.  We  removed  all  system  messages  (e.g., 
users  entering  or  leaving  a  channel)  except  for  messages 
which  indicate  a  user  changing  their  nickname.  We  did  this 
to  make  the  logs  consistent  -  Ubuntu  originally  recorded 
all  system  messages  but  later  recorded  only  the  nickname 
changes.  All  files  in  the  corpus  are  encoded  in  UTF-8,  and 
the  corpus  was  compressed  from  2.9GB  to  0.6GB.  The  cor¬ 
pus  is  available  at  http  :  //daviduthus  .  org/. 

Figure  1  shows  the  volume  of  messages  over  time  in  the 
channel  #ubuntu,  visualizing  the  cyclical  pattern  of  traf¬ 
fic  seen  in  Ubuntu’s  support  channels.  As  can  be  seen,  the 
number  of  messages  spikes  every  six  months,  which  coin¬ 
cides  with  Ubuntu’s  bi-annual  updated  release.  The  greatest 
peak  was  on  2006-05-27,  when  there  were  58  900  messages 
recorded  that  day,  or  0.7  messages  per  second. 

An  unfortunate  trend  seen  in  the  graph  is  that  the  vol¬ 
ume  of  messages  has  decreased.  This  downward  trend  does 
not  diminish  the  validity  of  using  this  corpus,  freenode, 
which  hosts  IRC  channels  for  many  open-source  projects, 
has  had  an  increasing  number  of  users  in  recent  years  (freen¬ 
ode  2012).  Thus  research  on  this  corpus  benefits  other  open- 
source  projects  that  use  IRC  for  their  technical  support. 
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Channel 

Number  of 
Messages 

Number  of 
Users 

Avg  Msg 
Length 

First 

Logged 

Description 

#ubuntu 

26  360715 

529  882 

57.6 

2004-05-07 

Ubuntu’s  primary  support 

#kubuntu 

5  963  258 

100411 

47.6 

2005-03-31 

Kubuntu  (Ubuntu  with  KDE)  support 

#ubuntu-devel 

2  1 12  074 

12140 

53.7 

2004-10-01 

Developmental  team  coordination 

#ubuntu+l 

1  621  680 

26  805 

52.6 

2007-04-04 

Developmental  versions’  support 

#ubuntu-cn 

1641416 

11  162 

21.7 

2010-11-04 

Support  for  Mainland  China 

#ubuntu-ru 

883  662 

8  320 

40.6 

2010-11-04 

Support  for  Russia 

#ubuntu-br 

649  969 

6  725 

34.4 

2010-11-04 

Support  for  Brazil 

#ubuntu-es 

646  675 

9  020 

41.3 

2010-11-04 

Support  for  Spanish  speakers 

#ubuntu-it 

645  375 

10316 

47.0 

2010-11-04 

Support  for  Italy 

#ubuntu-pl 

635  873 

3  467 

33.1 

2010-11-04 

Support  for  Poland 

#ubuntu-se 

550013 

2456 

45.2 

2010-11-04 

Support  for  Sweden 

Table  1 :  Description  of  the  channels  included  in  the  Ubuntu  Chat  Corpus.  The  average  message  length  is  defined  as  the  average 
number  of  characters  excluding  the  timestamp  and  user’s  nickname. 


Figure  1:  The  daily  volume  of  messages  in  #ubuntu. 


Comparison  with  Other  Chat  Data 

There  are  four  reasons  for  using  this  chat  corpus  compared 
to  other  collections  of  multiparticipant  chat  logs.1  First  is 
the  corpus  size  -  this  is  the  largest  collection  of  publicly- 
available  chat  logs  that  we  are  aware  of. 

Second,  the  original  messages  from  the  archive  are  in  the 
public  domain,  bypassing  many  legal  issues.  This  is  both 
stated  on  Ubuntu’s  website2  and  it  is  sent  as  a  message 
whenever  a  user  enters  one  of  their  logged  channels.  We 
have  not  found  any  other  set  of  chat  logs  that  have  been  ex¬ 
plicitly  declared  as  public  domain. 

Third,  the  logs  contain  technical  discussions,  which  al¬ 
lows  for  research  that  is  applicable  to  other  technical  do¬ 
mains  (e.g.,  business,  online  courses,  collaborative  learning, 
military  command  and  control).  There  is  one  disadvantage 
to  this  -  these  channels  have  less  social  chat  than  would  be 
seen  in  a  non-technical  chat  channel. 

Finally,  the  corpus  contains  channels  in  languages  other 
than  English,  yet  the  channels  cover  the  same  general  top¬ 
ics  as  discussed  in  the  English  channels.  While  there  is  no 
exact  match  between  messages  or  time-specific  topics  in  the 
different  channels,  one  can  assume  popular  topics  being  dis¬ 

1  These  four  claims  come  from  our  difficult  experience  of  find¬ 
ing  a  suitable  chat  corpus  for  our  research. 

2https : //help . ubuntu . com/coiranunity/ 
Internet Re lay Chat / 


cussed  in  the  main  support  channel  (which  is  English-only) 
will  probably  also  be  popular  in  the  non-English  channels. 

Challenge  Research  Problems 

We  now  describe  some  challenge  research  problems  that  can 
be  investigated  using  this  chat  corpus.  We  focus  on  problems 
that  have  received  minimal  or  no  research  attention  in  a  mul¬ 
tiparticipant  chat  domain.  To  overcome  these  problems,  ei¬ 
ther  techniques  from  other  domains  will  need  to  be  adapted 
or  new  techniques  will  need  to  be  created  for  this  domain. 

Intelligent  Word  Highlighting 

Word  highlighting  helps  users  find  messages  of  interest. 
Unfortunately,  current  state-of-the-art  word  highlighting  for 
chat  clients  is  rather  simple  -  users  enter  a  list  of  words  they 
want  highlighted,  and  the  client  will  only  highlight  these 
specific  words.  There  are  many  problems  with  this:  words 
can  be  misspelled  or  abbreviated;  words  can  be  falsely  high¬ 
lighted  due  to  lack  of  context  awareness;  or  relevant  words 
may  be  missed  since  it  is  difficult  to  predict  all  the  words  that 
one  might  need  highlighted.  In  addition,  different  channels 
(such  as  in  IRC)  can  have  different  meanings  for  the  same 
word,  e.g.,  the  term  “unity”  spoken  in  an  Ubuntu-related 
channel  would  refer  to  Ubuntu’s  new  user  interface  while  it 
would  usually  not  have  a  special  meaning  in  a  non-Ubuntu- 
related  channel. 

Given  this  difficulty  of  finding  relevant  messages  in  chat, 
techniques  are  needed  to  aid  users  in  filtering  the  messages. 
For  example,  techniques  that  can  suggest  related  words  to 
those  which  a  user  would  want  highlighted  could  aid  users 
in  finding  more  relevant  messages  and  easily  integrate  with 
current  chat  clients.  So  far,  a  few  researchers  have  inves¬ 
tigated  this  problem  from  a  military  perspective,  with  the 
goal  of  reducing  information  overload  for  military  person¬ 
nel  who  use  chat  for  command  and  control  communications 
(Berube  et  al.  2007;  Budlong,  Walter,  and  Yilmazel  2009; 
Dela  Rosa  and  Ellen  2009). 


Intelligent  Bots 

Ubuntu  uses  bots  to  aid  with  running  their  IRC  channels. 
One  of  these  bots,  ubottu,  contains  a  collection  of  factoids 
(short  messages),  which  can  be  used  to  answer  other  peo¬ 
ples’  questions.  While  this  is  evidently  helpful  (ubottu  has 
generated  the  most  messages  in  this  corpus),  the  question¬ 
answering  process  is  unfortunately  done  manually;  an  ex¬ 
perienced  user  must  direct  the  bot  as  to  which  question  to 
answer  and  which  factoid  to  use. 

Essentially,  the  challenge  is  how  to  create  an  intelligent 
bot  that  could  confidently  answer  common  questions  cor¬ 
rectly  while  allowing  more  expert  users  to  answer  ques¬ 
tions  beyond  its  capabilities.  An  even  more  difficult  problem 
would  be  to  create  a  bot  that  can  learn  answers  to  new  types 
questions,  such  as  when  new  software  has  been  introduced  in 
Ubuntu.  Some  research  has  investigated  creating  intelligent 
agents  in  a  multiparticipant  chat  domain,  such  as  Cobot  (Is¬ 
bell  et  al.  2006).  These  agents,  while  limited  in  conversation 
ability,  can  provide  a  starting  point  for  more  intelligent,  in¬ 
teractive  bots  in  such  a  domain. 

Automatic  Chat  Summarization 

As  previously  shown,  there  are  many  years  worth  of  chat 
messages  and  knowledge  archived  by  Ubuntu  which,  as  far 
as  we  know,  is  not  being  reused  outside  of  human-made  fac¬ 
toids  for  ubottu.  Summarization  techniques  leveraged  to 
summarize  answers  to  frequent  questions  would  be  bene¬ 
ficial,  as  this  would  then  allow  for  this  knowledge  to  be 
reused.  This  can  also  be  used  by  intelligent  bots  to  create  an¬ 
swers  to  new  types  of  questions.  An  advantage  of  this  chat 
corpus  is  that  there  are  already  human-authored  summaries 
in  the  form  of  factoids  that  can  be  used  as  gold  standards  for 
evaluations. 

There  has  been  little  research  on  automatic  multipartic¬ 
ipant  chat  summarization.  Zhou  and  Hovy  (2005)  investi¬ 
gated  summarizing  chat  messages  with  extractive  methods 
to  create  summaries  similar  to  human-made  summaries.  We 
recently  described  our  goals  for  investigating  how  to  sum¬ 
marize  conversation  threads  of  chat  messages  (Uthus  and 
Aha  2011). 

Multi-Language  Techniques 

Most  research  on  multiparticipant  chat  has  used  chat  logs 
whose  messages  are  in  only  a  single  language,  with  most  fo¬ 
cusing  on  English.  This  corpus  provides  a  great  resource  for 
investigating  techniques  on  non-English  languages  and  for 
investigating  techniques  which  are  language-independent, 
such  as  for  thread  disentanglement. 

Another  research  problem  on  multiple  languages  in  chat 
is  translating  chat  messages.  #ubuntu  is  visited  by  far 
more  users  than  any  of  the  non-English  channels,  so  it 
is  easy  for  a  user  to  receive  help  if  one  is  fluent  in  En¬ 
glish.  For  those  who  are  not,  there  might  not  be  many  ex¬ 
pert  users  who  write  in  their  native  language,  which  can 
then  make  it  difficult  to  receive  any  help.  Machine  transla¬ 
tion  could  then  help  overcome  this  problem.  So  far,  there 
has  been  some  limited  work  focused  on  multiparticipant 
chat  translation  (Calefato,  Lanubile,  and  Minervini  2010; 


Yamashita  et  al.  2009;  Yoshino  and  Ikenobu  2010),  but  these 
studies  only  examined  users  chatting  in  small  group  settings 
on  constrained  tasks. 

In  relation  to  intelligent  bots,  this  corpus  can  be  used  to 
detect  non-English  messages  in  Ubuntu’s  English  support 
channels.  This  can  then  aid  in  directing  users  to  more  ap¬ 
propriate  channels.  Recent  similar  work  has  been  reported 
on  detecting  the  language  of  tweets  for  creating  language- 
specific  Twitter  collections  (Bergsma  et  al.  2012),  which  can 
be  used  as  a  starting  point  due  to  some  of  the  shared  similar¬ 
ities  between  microblogs  and  chat. 

Conclusions 

We  have  presented  the  Ubuntu  Chat  Corpus  as  a  data  source 
for  research  on  multiparticipant  chat  analysis.  It  has  many 
benefits  that  make  it  useful  for  research  in  this  medium:  its 
large  size,  its  public  domain  status,  its  technical  discussions, 
and  it  contains  chat  logs  in  non-English  languages.  We  have 
also  described  some  challenging  problems  in  multipartici¬ 
pant  chat  analysis  that  have  received  little  research  attention 
and  which  would  be  suitable  to  investigate  with  this  corpus. 
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